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Abstract 

In many practical applications of clustering, the objects to be clustered 
evolve over time, and a clustering result is desired at each time step. In such 
applications, evolutionary clustering typically outperforms traditional static 
clustering by producing clustering results that reflect long-term trends while 
being robust to short-term variations. Several evolutionary clustering algo- 
rithms have recently been proposed, often by adding a temporal smoothness 
penalty to the cost function of a static clustering method. In this paper, we 
introduce a different approach to evolutionary clustering by accurately track- 
ing the time-varying proximities between objects followed by static cluster- 
ing. We present an evolutionary clustering framework that adaptively esti- 
mates the optimal smoothing parameter using shrinkage estimation, a statis- 
tical approach that improves a naive estimate using additional information. 
The proposed framework can be used to extend a variety of static cluster- 
ing algorithms, including hierarchical, k-means, and spectral clustering, into 
evolutionary clustering algorithms. Experiments on synthetic and real data 
sets indicate that the proposed framework outperforms static clustering and 
existing evolutionary clustering algorithms in many scenarios. 



1 Introduction 



In many practical applications of clustering, the objects to be clustered are ob- 
served at many points in time, and the goal is to obtain a clustering result at each 
time step. This situation arises in appUcations such as identifying communities in 
dynamic social networks ( [FaUcowski et aH |2006[ [Tantipathananandh et aH |2007[ ), 
tracking groups of moving objects (Li et al. 2004[ Carmi et aL] 2009[ ), finding time- 
varying clusters of stocks or currencies in financial markets ([Fenn et aL||2009|), and 
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many other applications in data mining, machine learning, and signal processing. 
Typically the objects evolve over time both as a result of long-term drifts due to 
changes in their statistical properties and short-term variations due to noise. 

A naive approach to these types of problems is to perform static clustering at 
each time step using only the most recent data. This approach is extremely sensitive 
to noise and produces clustering results that are unstable and inconsistent with 
clustering results from adjacent time steps. Subsequently, evolutionary clustering 
methods have been developed, with the goal of producing clustering results that 
reflect long-term drifts in the objects while being robust to short-term variation^ 

Several evolutionary clustering algorithms have recently been proposed by 
adding a temporal smoothness penalty to the cost function of a static clustering 
method. This penalty prevents the clustering result at any given time from deviat- 
ing too much from the clustering results at neighboring time steps. This approach 
has produced evolutionary extensions of commonly used static clustering methods 
such as agglomerative hierarchical clustering (Chakrabarti et al.[ 12006'), k-means 



( [Chakrabarti et al.[|2006p , Gaussian mixture models ( |Zhang et al.[[2009| ), and spec 



tral clustering ( Tang et aL| 2008[ Chi et al. 20091 among others. How to choose 



the weight of the penalty in an optimal manner in practice, however, remains an 
open problem. 

In this paper, we propose a different approach to evolutionary clustering by 
treating it as a problem of tracking followed by static clustering (Section [3]). We 
model the observed matrix of proximities between objects at each time step, which 
can be either similarities or dissimilarities, as a linear combination of a true prox- 
imity matrix and a zero-mean noise matrix. The true proximities, which vary over 
time, can be viewed as unobserved states of a dynamic system. Our approach 
involves estimating these states using both current and past proximities, then per- 
forming static clustering on the state estimates. 

The states are estimated using a restricted class of estimators known as shrink- 
age estimators, which improve a raw estimate by combining it with other infor- 
mation. We develop a method for estimating the optimal weight to place on past 
proximities so as to minimize the mean squared error (MSE) between the true prox- 
imities and our estimates. We call this weight the. forgetting factor . One advantage 
of our approach is that it provides an explicit formula for the optimal forgetting 
factor, unlike existing evolutionary clustering methods. The forgetting factor is es- 
timated adaptively, which allows it to vary over time to adjust to the conditions of 
the dynamic system. 

The proposed framework, which we call Adaptive Forgetting Factor for Evo- 



'The term "evolutionary clustering" has also been used to refer to clustering algorithms motivated 
by biological evolution, which are unrelated to the methods discussed in this paper. 
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lutionary Clustering and Tracking (AFFECT), can extend any static clustering al- 
gorithm that uses pairwise similarities or dissimilarities into an evolutionary clus- 
tering algorithm. It is flexible enough to handle changes in the number of clusters 
over time and to accommodate objects entering and leaving the data set between 
time steps. We demonstrate how AFFECT can be used to extend three popular 
static clustering algorithms, namely hierarchical clustering, k-means, and spectral 
clustering, into evolutionary clustering algorithms (Section |4]). These algorithms 
are tested on several synthetic and real data sets (Section [5]). We find that they not 
only outperform static clustering, but also other recently proposed evolutionary 
clustering algorithms due to the adaptively selected forgetting factor. 

The main contribution of this paper is the development of the AFFECT adap- 
tive evolutionary clustering framework, which has several advantages over existing 
evolutionary clustering approaches: 

1. It involves smoothing proximities between objects over time followed by 
static clustering, which enables it to extend any static clustering algorithm 
that takes a proximity matrix as input to an evolutionary clustering algorithm. 

2. It provides an explicit formula and estimation procedure for the optimal 
weight (forgetting factor) to apply to past proximities. 

3. It outperforms static clustering and existing evolutionary clustering algo- 
rithms in several experiments with a minimal increase in computation time 
compared to static clustering (if a single iteration is used to estimate the for- 
getting factor). 



This paper is an extension of our previous work (Xu et al.] |2010[), which was 



limited to evolutionary spectral clustering. In this paper, we extend the previously 
proposed framework to other static clustering algorithms. We also provide addi- 



tional insight into the model assumptions in Xu et al. (20101 and demonstrate the 



effectiveness of AFFECT in several additional experiments. 



2 Background 

2.1 Static clustering algorithms 

We begin by reviewing three commonly used static clustering algorithms. We 
demonstrate the evolutionary extension of these algorithms in Section |4j although 
the AFFECT framework can be used to extend many other static clustering algo- 
rithms. The term "clustering" is used in this paper to refer to both data clustering 
and graph clustering. The notation i G c is used to denote object i being assigned to 
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1: Assign each object to its own cluster 
2: repeat 

3: Compute dissimilarities between each pair of clusters 
4: Merge clusters with the lowest dissimilarity 
5: until all objects are merged into one cluster 
6: return dendrogram 

Figure 1 : A general agglomerative hierarchical clustering algorithm. 



cluster c. |c| denotes the number of objects in cluster c, and C denotes a clustering 
result (the set of all clusters). 

In the case of data clustering, we assume that the n objects in the data set are 
stored in an n x p matrix X, where object i is represented by a p-dimensional 
feature vector Xj corresponding to the ith row of X. From these feature vectors, 
one can create a proximity matrix W, where Wij denotes the proximity between 
objects i and j, which could be their Euclidean distance or any other similarity or 
dissimilarity measure. 

For graph clustering, we assume that the n vertices in the graph are represented 
by an n X n adjacency matrix W where w-ij denotes the weight of the edge between 
vertices i and j. If there is no edge between i and j, then Wij = 0. For the usual 
case of undirected graphs with non-negative edge weights, an adjacency matrix is 
a similarity matrix, so we shall refer to it also as a proximity matrix. 



2.1.1 Agglomerative hierarchical clustering 

Agglomerative hierarchical clustering algorithms are greedy algorithms that create 
a hierarchical clustering result, often represented by a dendrogram CHastie et alj 



2001 1. The dendrogram can be cut at a certain level to obtain a flat clustering re- 
sult. There are many variants of agglomerative hierarchical clustering. A general 
algorithm is described in Fig. [T] Varying the definition of dissimilarity between 
a pair of clusters often changes the clustering results. Three common choices are 
to use the minimum dissimilarity between objects in the two clusters (single link- 
age), the maximum dissimilarity (complete linkage), or the average dissimilarity 
(average linkage) ( |Hastie et"aL]|2001| l. 
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i ^ 

(^(0) vector of random integers in {1, ... ,k} 
Compute similarity matrix W = XX^ 
repeat 

i ^ i + 1 

Calculate squared distance between all objects and centroids using (|2]) 
Compute C*-*^ by assigning each object to its closest centroid 

until CW = C(*-i) 

return C*^*) 



Figure 2: Pseudocode for k-means clustering using similarity matrix W. 



2.1.2 k-means 



k-means clustering (MacQueen 1967 Hastieetal. 2001 1 attempts to find clusters 
that minimize the sum of squares cost function 



P(X,C) = ^J^||x, 



(1) 



where 



c=l igc 

denotes the ^2 -norm, and rric is the centroid of cluster c, given by 



Each object is assigned to the cluster with the closest centroid. The cost of a 
clustering result C is simply the sum of squared Euclidean distances between each 
object and its closest centroid. The squared distance in ([T]) can be rewritten as 



X,; 



Wii fn \ • 



(2) 



where ^, the dot product of the feature vectors. Using the form of Q to 

compute the k-means cost in ([T]) allows the k-means algorithm to be implemented 
with only the similarity matrix W = [ttJjj]"^^^ consisting of all pairs of dot prod- 
ucts, as described in Fig.|2] 



2.1.3 Spectral clustering 



Spectral clustering ( |Shi and MaEk} [20001 |Ng et al.[ [200T] |von Luxburg] [20071 ) is 
a popular modern clustering technique inspired by spectral graph theory. It can be 
used for both data and graph clustering. When used for data clustering, the first step 
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1: Z -(^ k smallest eigenvectors of C 
2: for i = 1 to n do 

3: Zi ^ Zi/||zj|| {Normalize each row of Z to have unit norm} 

4: end for 

5: C -(r- kmeans(Z) 

6: return C 

Figure 3: Pseudocode for normalized cut spectral clustering. 



in spectral clustering is to create a similarity graph with vertices corresponding to 
the objects and edge weights corresponding to the similarities between objects. We 
represent the graph by an adjacency matrix W with edge weights Wij given by a 
positive definite similarity function s(xj, Xj). The most commonly used similarity 
function is the Gaussian similarity function s(xj, Xj) = exp{ — ||xj — x^ |p/ (2p^)} 
( Ng et al.] 200 where p is a scaling parameter. Let D denote a diagonal matrix 



with elements corresponding to row sums of W. Define the unnormalized graph 



Lapla cian matrix by L = D — W and the normalized Laplacian matrix ( [Chung 



1997| ) by £ = / - D-i/2|y/)-i/2 

Three common variants of spectral clustering are average association (AA), 
ratio cut (RC), and normahzed cut (NC) ( |Shi and Malik[ [2000^ . Each variant is 
associated with an NP-hard graph optimization problem. Spectral clustering solves 



relaxed versions of these problems. The relaxed problems can be written as ( von 
LuxburgI |2007| |Chi et al. [ |2009] l 



AA(Z) = max tr(Z^ WZ) subject to Z^ Z = I (3) 



RC{Z) = min tr(Z^LZ) subject to Z^ Z = I (4) 



NC(Z) = min tr(Z^ CZ) subject to Z^ Z = I. (5) 

These are variants of a trace optimization problem; the solutions are given by a 
generalized Rayleigh-Ritz theorem ( jLutkepohl] 1997| ). The optimal solution to Q 



consists of the matrix containing the eigenvectors corresponding to the k largest 
eigenvalues of W as columns. Similarly, the optimal solutions to Q and ([5]) con- 
sist of the matrices containing the eigenvectors corresponding to the k smallest 
eigenvalues of L and £, respectively. The optimal relaxed solution Z is then dis- 
cretized to obtain a clustering result, typically by running the standard k-means 
algorithm on the rows of Z or a normalized version of Z. 

An algorithm ( |Ng et al. 200 1| ) for normalized cut spectral clustering is shown 



in Fig. [3] To perform ratio cut spectral clustering, compute eigenvectors of L 
instead of C and ignore the row normalization in steps |2]-^ Similarly, to perform 
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average association spectral clustering, compute instead the k largest eigenvectors 
of W and ignore the row normalization in steps [2]-|4] 



2.2 Related work 

We now summarize some contributions in the related areas of incremental and 
constrained clustering, as well as existing work on evolutionary clustering. 



2.2.1 Incremental clustering 

The term "incremental clustering" has typically been used to describe two types of 
clustering problem^ 

1. Sequentially clustering objects that are each observed only once. 

2. Clustering objects that are each observed over multiple time steps. 

Type [T] is also known as data stream clustering, and the focus is on clustering the 
data in a single pass and with limited memory ( Charikar et al. 2004} Gupta and 
Grossman 2004[ ). It is not directly related to our work because in data stream 



clustering each object is observed only once. 

Type[2]is of greater relevance to our work and targets the same problem setting 
as evolutionary clustering. Several incremental algorithms of this type have been 
proposed ( jLi et al.[ |2004[ |Sun et al.[ |2007[ |Ning et al.[ |2010| ). These incremental 
clustering algorithms could also be apphed to the type of problems we consider; 
however, the focus of incremental clustering is on low computational cost at the 
expense of clustering quality. The incremental clustering result is often worse than 
the result of performing static clustering at each time step, which is already a sub- 
optimal approach as mentioned in the introduction. On the other hand, evolutionary 
clustering is concerned with improving clustering quality by intelligently combin- 
ing data from multiple time steps and is capable of outperforming static clustering. 



2.2.2 Constrained clustering 

The objective of constrained clustering is to find a clustering result that optimizes 
some goodness-of-fit objective (such as the k-means sum of squares cost function 
([1])) subject to a set of constraints. The constraints can either be hard or soft. Hard 
constraints can be used, for example, to specify that two objects must or must not 
be in the same cluster ( Wagstaff et al.[[2C)01[ Wang and Davidson 2010 1. On the 



^It is also sometimes used to refer to the simple approach of performing static clustering at each 
time step. 
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other hand, soft constraints can be used to specify real-valued preferences, which 
may be obtained from labels or other prior information ( Ji and Xu} [2006 Wang 



and Davidson! 2010"). These soft constraints are similar to evolutionary clustering 



in that they bias clustering results based on additional information; in the case of 
evolutionary clustering, the additional information could correspond to historical 
data or clustering results. 

[Tadepalli et al.| (2009 1 considered the problem of clustering time-evolving ob- 



jects such that objects in the same cluster at a particular time step are unlikely to 
be in the same cluster at the following time step. Such an approach allows one 
to divide the time series into segments that differ significantly from one another. 
Notice that this is the opposite of the evolutionary clustering objective, which fa- 



vors smooth evolutions in cluster memberships over time. [Hossain et al. (20101 



proposed a framework that unifies these two objectives, which are referred to as 
disparate and dependent clustering, respectively. Both can be viewed as clustering 
with soft constraints to minimize or maximize similarity between multiple sets of 
clusters, e.g. clusters at different time steps. 

2.2.3 Evolutionary clustering 

The topic of evolutionary clustering has attracted significant attention in recent 



years. Chakrabarti et al. (2006 1 introduced the problem and proposed a general 
framework for evolutionary clustering by adding a temporal smoothness penalty to 
a static clustering method. Evolutionary extensions for agglomerative hierarchical 
clustering and k-means were presented as examples of the framework. 

[Chi et al. ( 2009| l expanded on this idea by proposing two frameworks for evo- 



lutionary spectral clustering, which they called Preserving Cluster Quality (PCQ) 
and Preserving Cluster Membership (PCM). Both frameworks proposed to opti- 
mize the modified cost function 

C'total = Q Ctemporal + (! — «) Csnapshot; (6) 

where Csnapshot denotes the static spectral clustering cost, which is typically taken 
to be the average association, ratio cut, or normalized cut as discussed in Section 



2. 1.3 The two frameworks differ in how the temporal smoothness penalty Ctemporal 
is defined. In PCQ, Ctemporal is defined to be the cost of applying the clustering 
result at time t to the similarity matrix at time t — 1. In other words, it penalizes 
clustering results that disagree with past similarities. In PCM, Ctemporal is defined 
to be a measure of distance between the clustering results at time t and t — 1. In 
other words, it penalizes clustering results that disagree with past clustering results. 
Both choices of temporal cost are quadratic in the cluster memberships, similar to 
the static spectral clustering cost as in ([3])-([5]l, so optimizing (|6]l in either case is 
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simply a trace optimization problem. For example, the PCQ average association 
evolutionary spectral clustering problem is given by 



max a tr (z'^W^-^Z) + (1 - q) tr (z'^W^Z) subject to Z^Z = /, 



where and ^ denote the adjacency matrices at times t and t — 1, respec- 
tively. The PCQ cluster memberships can be found by computing eigenvectors of 



aW + (1 — a)W and then discretizing as discussed in Section 2.1.3 Our work 



takes a different approach than that of Chi et al. ( 2009| ) but the resulting framework 
shares some similarities with the PCQ framework. In particular, AFFECT paired 
with average association spectral clustering is an extension of PCQ to longer his- 



tory, which we discuss in Section 4.3 



Following these works, other evolutionary clustering algorithms that attempt to 
optimize the modified cost function defined in (|6]) have been proposed ( |Tang et al 



2008| |Lin et &\.\ [20091 [Zhang et aLj [2009| [Mucha et al] [20T0[ ). The definitions of 
snapshot and temporal cost and the clustering algorithms vary by approach. None 
of the aforementioned works addresses the problem of how to choose the parameter 
a in ([6]l, which determines how much weight to place on historic data or clustering 
results. It has typically been suggested ( Chi et alij 2009 ; Lin et al. 2009[ ) to choose 
it in an ad-hoc manner according to the user's subjective preference on the temporal 
smoothness of the clustering results. 

It could also be beneficial to allow a to vary with time. Zhang et al.| ( 2009^ 
proposed to choose a adaptively by using a test statistic for checking dependency 
between two data sets ( Gretton et al.[[2007l ). However, this test statistic also does 
not satisfy any optimality properties for evolutionary clustering and still depends 
on a global parameter reflecting the user's preference on temporal smoothness, 
which is undesirable. 



The existing method that is most similar to AFFECT is that of Rosswog and 



Ghose (2008 1, which we refer to as RG. The authors proposed evolutionary exten- 
sions of k-means and agglomerative hierarchical clustering by filtering the feature 
vectors using a Finite Impulse Response (FIR) filter, which combines the last / + 1 
measurements of the feature vectors by the weighted sum y ■ = 6ox* + ^ix*^^ + 
• • • + h/x*^', where I is the order of the filter, y* is the filter output at time t, and 
bo, . . . ,bi are the filter coefficients. The proximities are then calculated between the 
filter outputs rather than the feature vectors. The main resemblance between RG 
and AFFECT is that RG is also based on tracking followed by static clustering. In 
particular, RG adaptively selects the filter coefficients based on the dissimilarities 
between cluster centroids at the past I time steps. However, RG cannot accommo- 
date varying numbers of clusters over time nor can it deal with objects entering and 
leaving at various time steps. It also struggles to adapt to changes in clusters, as we 
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demonstrate in Section 5.2 AFFECT, on the other hand, is able to adapt quickly 
to changes in clusters and is applicable to a much larger class of problems. 

Finally, there has also been recent interest in model-based evolutionary cluster- 
ing. In addition to the aforementioned method involving mixtures of exponential 
families ([Zhang et al.[|2009]), methods have also been proposed using semi-Markov 



2008 


Xu et al.| 2008b I, hierarchical DPMs ( Xu et al. 2008b|a 


Zhang et al.n2010l). 


and smooth plaid models (|Mankad et al.j 201 1 


1. For these methods, the temporal 



evolution is controlled by hyperparameters that can be estimated in some cases. 



3 Proposed evolutionary framework 

The proposed framework treats evolutionary clustering as a tracking problem fol- 
lowed by ordinary static clustering. In the case of data clustering, we assume 
that the feature vectors have already been converted into a proximity matrix, as 



discussed in Section 2.1 We treat the proximity matrices, denoted by W^, as re- 
alizations from a non-stationary random process indexed by discrete time steps, 
denoted by the superscript t. We assume, like many other evolutionary cluster- 
ing algorithms, that the identities of the objects can be tracked over time so that the 
rows and columns of correspond to the same objects as those of W^~^ provided 
that no objects are added or removed (we describe how the proposed framework 



handles adding and removing objects in Section 4.4.1 1. Furthermore we posit the 
linear observation model 

W^ = ¥ + N\ t = 0,1,2,... (7) 

where is an unknown deterministic matrix of unobserved states, and is a 
zero-mean noise matrix. changes over time to reflect long-term drifts in the 
proximities. We refer to as the true proximity matrix, and our goal is to ac- 
curately estimate it at each time step. On the other hand, A^* reflects short-term 
variations due to noise. Thus we assume that A^*, A^*~^, . . . , are mutually in- 
dependent. 

A common approach for tracking unobserved states in a dynamic system is to 
use a Kalman filter ( |Harvey[ I989| Haykin[ 200 1[ ) or some variant. Since the states 



correspond to the true proximities, there are O(n^) states and O(n^) observations, 
which makes the Kalman filter impractical for two reasons. First, it involves speci- 
fying a parametric model for the state evolution over time, and secondly, it requires 
the inversion of an O(n^) x O(n^) covariance matrix, which is large enough in 
most evolutionary clustering applications to make matrix inversion computation- 
ally infeasible. We present a simpler approach that involves a recursive update of 
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the state estimates using only a single parameter a*, which we define in ([8]). 
3.1 Smoothed proximity matrix 

If the true proximity matrix ^* is known, we would expect to see improved cluster- 
ing results by performing static clustering on rather than on the current proxim- 
ity matrix because is free from noise. Our objective is to accurately estimate 

at each time step. We can then perform static clustering on our estimate, which 
should also lead to improved clustering results. 

The naive approach of performing static clustering on at each time step 
can be interpreted as using itself as an estimate for ^f*. The main disadvantage 
of this approach is that it suffers from high variance due to the observation noise 
A^*. As a consequence, the obtained clustering results can be highly unstable and 
inconsistent with clustering results from adjacent time steps. 

A better estimate can be obtained using the smoothed proximity matrix de- 
fined by 

I-* = a*^^-^ + (1 - a^)W^ (8) 

for t > 1 and by = W^. Notice that is a function of current and past data 
only, so it can be computed in the on-line setting where a clustering result for time 
t is desired before data at time t + 1 can be obtained. ^* incorporates proximities 
not only from time t — 1, but potentially from all previous time steps and allows 
us to suppress the observation noise. The parameter a* controls the rate at which 
past proximities are forgotten; hence we refer to it as the. forgetting factor. The 
forgetting factor in our framework can change over time, allowing the amount of 
temporal smoothing to vary. 



3.2 Shrinkage estimation of true proximity matrix 



The smoothed proximity matrix is a natural candidate for estimating It is 
a convex combination of two estimators: and Since A^* is zero-mean, 

is an unbiased estimator but has high variance because it uses only a single 
observation. is a weighted combination of past observations so it should 

have lower variance than W^, but it is likely to be biased since the past proximities 
may not be representative of the current ones as a result of long-term drift in the 
statistical properties of the objects. Thus the problem of estimating the optimal 
forgetting factor a* may be considered as a bias-variance trade-off problem. 

A similar bias-vaiiance trade-off has been investigated in the problem of shrink- 
age estimation of covariance matrices ( [Ledoit and Wolf[ |2003t [Schafer and Strim 



mer 



2005t|Chen et al.[|2010), where a shrinkage estimate of the covariance matrix 



is taken to be E = AT+ (1 — A)S', a convex combination of a suitably chosen target 
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matrix T and the standard estimate, the sample covariance matrix S. Notice that 
the shrinkage estimate has the same form as the smoothed proximity matrix given 
by ([8]) where the smoothed proximity matrix at the previous time step ^I'*"^ cor- 
responds to the shrinkage target T, the current proximity matrix corresponds 
to the sample covariance matrix S, and a* corresponds to the shrinkage intensity 
A. We derive the optimal choice of a* in a manner similar to Ledoit and Wolf's 
derivation of the optimal A for shrinkage estimation of covariance matrices ( |Ledoit| 
and Wolf] |2003] ). 

As in [Ledoit and Wolf| ( |2003l ), [Schafer and Strimmerl ( |2005] ), and [Chen et"aL 



(2010), we choose to minimize the squared Frobenius norm of the difference be- 
tween the true proximity matrix and the smoothed proximity matrix. That is, we 
take the loss function to be 



L(a*) 



i=i j=i 

We define the risk to be the conditional expectation of the loss function given all 
of the previous observations 



E 







2 








F 





where I^^*"^) denotes the set T^*"^, . . . , W^}. Note that the risk function 

is differentiable and can be easily optimized if is known. However, ^* is the 
quantity that we are trying to estimate so it is not known. We first derive the 
optimal forgetting factor assuming it is known. We shall henceforth refer to this as 
the oracle forgetting factor. 

Under the linear observation model of ([7|l, 



E 



var (w^W^^-'^A = var {W^) = var (iV*) 



(9) 

because A^*, A*^^, . . . , are mutually independent and have zero mean. From 
the definition of in ([8]), the risk can then be expressed as 



R[a*) 



T.T. 

i=i j=i 

n n 



E 



^j;^var(a*V^-^+(l-aO-*,-^i, 
i=i j=i 



(11) 



+ E 



a 
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( fTTj) can be simplified using Q and ( [T0| ) and by noting that the conditional variance 



of ip-- is zero and that tpj, is deterministic. Thus 



n n ^ 2") 

^(«*) = EE (i-«*)'--rK) + («0'(v^r'-V'*,) . (12) 

i=l 7 = 1 J 



From ( 12 1, the first derivative is easily seen to be 



^'(«*) = 2 E E - 1) (4) + (V'^' - 
i=i i=i 

To determine the oracle forgetting factor (a*)*, simply set i?'(a*) = 0. Rearrang- 
ing to isolate a*, we obtain 

n n 

(«T = n n T- (13) 

EE (V^S"^-4)'+var(n*,) 
i=i j=i ^ 

We find that (q*) * does indeed minimize the risk because R" (a*) > for all a*. 

The oracle forgetting factor (a*)* leads to the best estimate in terms of mini- 
mizing risk but is not implementable because it requires oracle knowledge of the 
true proximity matrix ^f*, which is what we are trying to estimate, as well as the 



noise variance var (iV*)- It was suggested in Schafer and Strimmer (2005 1 to re- 
place the unknowns with their sample equivalents. In this setting, we would replace 
iplj with the sample mean of wlj and var(n-j) = vaT{wlj) with the sample vari- 
ance of wjj. However, and potentially var [N^] are time- varying so we cannot 
simply use the temporal sample mean and variance. Instead, we propose to use the 
spatial sample mean and variance. Since objects belong to clusters, it is reasonable 
to assume that the structure of and var (A^*) should reflect the cluster member- 
ships. Hence we make an assumption about the structure of ^* and var (A^*) in 
order to proceed. 



3.3 Block model for true proximity matrix 

We propose a block model for the true proximity matrix and var (iV*) and use 
the assumptions of this model to compute the desired sample means and variances. 
The assumptions of the block model are as follows: 

1. ipji = ij^^jj for any two objects i, j that belong to the same cluster. 
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Figure 4: Block structure of true proximity matrix ^/j^^-^ denotes -0*^ for all 
objects i in cluster c, and V'(cd) denotes for all distinct objects i,j such that i is 
in cluster c and j is in cluster d. 

2. = Tpj^ for any two distinct objects i, j and any two distinct objects /, m 
such that i, I belong to the same cluster, and j, m belong to the same cluster. 

The structure of the true proximity matrix under these assumptions is shown in 
Fig. |4] In short, we are assuming that the true proximity is equal inside the clusters 
and different between clusters. We make the assumptions on var (A^*) that we do 
on namely that it also possesses the assumed block structure. 

One scenario where the block assumptions are completely satisfied is the case 
where the data at each time t are realizations from a dynamic Gaussian mixture 
model (GMM) ( jCarmi et al.[ |2009[ l, which is described as follows. Assume that 
the k components of the dynamic GMM are parameterized by the k time-varying 
mean vectors {/^c}c=i covariance matrices {S*}J^^.Let {(/.J^^^ denote the 
mixture weights. Objects are sampled in the following manner: 

1. (Only at t = 0) Draw n samples {zi}^^i from the categorical distribution 
specified by {cl)c}c=i to determine the component membership of each ob- 
ject. 

2. (For all t) For each object i, draw a sample x* from the Gaussian distribution 
parameterized by (/x* . , .) . 

Notice that while the parameters of the individual components change over time, 
the component memberships do not, i.e. objects stay in the same components over 
time. The dynamic GMM simulates clusters moving in time. In Appendix [Aj we 
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show that at each time t, the mean and variance of the dot product similarity ma- 
trix W^, which correspond to ^* and var (A^*) respectively under the observation 
model of (|7]l, do indeed satisfy the assumed block structure. This scenario forms 



the basis of the experiment in Section 5. 1 



Although the proposed block model is rather simplistic, we believe that it is a 
reasonable choice when there is no prior information about the shapes of clusters. 
A similar block assumption has also been used in the dynamic stochastic block 
model ( Yang et aL| 201 1| ), developed for modeling dynamic social networks. A 
nice feature of the proposed block model is that it is permutation invariant with 
respect to the clusters; that is, it does not require objects to be ordered in any 
particular manner. The extension of the proposed framework to other models is 
beyond the scope of this paper and is an area for future work. 



3.4 Adaptive estimation of forgetting factor 

Under the block model assumption, the means and variances of proximities are 
identical in each block. As a result, we can sample over all proximities in a block 
to obtain sample means and variances. Unfortunately, we do not know the true 
block structure because the cluster memberships are unknown. 

To work around this problem, we estimate the cluster memberships along with 
(a*) in an iterative fashion. First we initialize the cluster memberships. Two log- 
ical choices are to use the cluster memberships from the previous time step or the 
memberships obtained from performing static clustering on the current proximities. 
We can then sample over each block to estimate the entries of and var (A^*) as 
detailed below, and substitute them into ( [T3] ) to obtain an estimate (a*)* of (a*)*- 
Now substitute (a*)* into ^ and perform static clustering on to obtain an up- 
dated clustering result. This clustering result is then used to refine the estimate of 
(a*) , and this iterative process is repeated to improve the quality of the clustering 
result. We find, empirically, that the estimated forgetting factor rarely changes after 
the third iteration and that even a single iteration often provides a good estimate. 

To estimate the entries of = E [W^*] , we proceed as follows. For two 
distinct objects i and j both in cluster c, we estimate ip\ - using the sample mean 

^ Hj\ = id(id-i) ^ ^ 

Similarly, we estimate i/jj- by 
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1: C* ^ C*-l 



for i = 1, 2, ... do {iteration number} 
Compute E [W^] and var (VF*) using C* 

Calculate (a*)* by substituting estimates E [W^] and var (W^) into ( [T3] ) 
^ (a*)* + [!-(««)*] W' 

C* ^ cluster (^*) 
end for 
return C* 



Figure 5: Pseudocode for generic AFFECT evolutionary clustering algorithm. 
Cluster (•) denotes any static clustering algorithm that takes a similarity or dis- 
similarity matrix as input and returns a flat clustering result. 

For distinct objects i in cluster c and j in cluster d with d,'we. estimate - by 



a-' 

l^c mda 



var (iV*) = var (PV^*) can be estimated in a similar manner by taking unbiased 
sample variances over the blocks. 



4 Evolutionary algorithms 

From the derivation in Section |3.4[ we have the generic algorithm for AFFECT 
at each time step shown in Fig. [5] We provide some details and interpretation of 
this generic algorithm when used with three popular static clustering algorithms: 
agglomerative hierarchical clustering, k-means, and spectral clustering. 

4.1 Agglomerative hierarchical clustering 

The proposed evolutionary extension of agglomerative hierarchical clustering has 
an interesting interpretation in terms of the modified cost function defined in ([6]). 
Recall that agglomerative hierarchical clustering is a greedy algorithm that merges 
the two clusters with the lowest dissimilarity at each iteration. The dissimilarity 
between two clusters can be interpreted as the cost of merging them. Thus, per- 
forming agglomerative hierarchical clustering on results in merging the two 
clusters with the lowest modified cost at each iteration. The snapshot cost of a 
merge corresponds to the cost of making the merge at time t using the dissimi- 
larities given by VF*. The temporal cost of a merge is a weighted combination of 
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the costs of making the merge at each time step s G {0,l,...,t — 1} using the 
dissimilarities given by W^. This can be seen by expanding the recursive update 
in ^ to obtain 

= (1 - a*) W' + a\l- a'-')W'-' + a^a*"^ (l - a*-^) W'-^ 
4.2 k-means 

k-means is an iterative clustering algorithm and requires an initial set of cluster 
memberships to begin the iteration. In static k-means, typically a random initial- 
ization is employed. A good initialization can significantly speed up the algorithm 
by reducing the number of iterations required for convergence. For evolutionary k- 
means, an obvious choice is to initialize using the clustering result at the previous 
time step. We use this initialization in our experiments in Section [5] 

The proposed evolutionary k-means algorithm can also be interpreted as opti- 
mizing the modified cost function of ([6]). The snapshot cost is V (X*, C*) where 
V{-, •) is the sum of squares cost defined in ([T]). The temporal cost is a weighted 
combination of V (X*, C^) , s G {0, 1, . . . , t — 1}, i.e. the cost of the clustering 
result applied to the data at time s. Hence the modified cost measures how well the 
current clustering result fits both current and past data. 



4.3 Spectral clustering 



The proposed evolutionary average association spectral clustering algorithm in- 
volves computing and discretizing eigenvectors of ^* rather than W^. It can also 
be interpreted in terms of the modified cost function of ([6]). Recall that the cost in 
static average association spectral clustering is tr {Zi^WZ) . Performing average 
association spectral clustering on ^* optimizes 



tr 



/ s=0 



(15) 



where corresponds to the coefficient in front of in ([T4]l. Thus, the snapshot 
cost is simply tr [Z^W^Z) while the temporal cost corresponds to the remaining 
t terms in ([15]). We note that in the case where a*^^ = 0, this modified cost is 
identical to that of PCQ, which incorporates historical data from time t — \ only. 
Hence our proposed generic framework reduces to PCQ in this special case. 

[Chi et al.] (2009 1 noted that PCQ can easily be extended to accommodate longer 
history and suggested to do so by using a constant exponentially weighted forget- 
ting factor. Our proposed framework uses an adaptive forgetting factor, which 
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Figure 6: Adding and removing objects over time. Shaded rows and columns are 
to be removed before computing ^f*. The rows and columns for the new objects 
are then appended to 

should improve clustering performance, especially if the rate at which the statisti- 
cal properties of the data are evolving is time- varying. 

Evolutionary ratio cut and normalized cut spectral clustering can be performed 
by forming the appropriate graph Laplacian, L* or £*, respectively, using in- 
stead of W^. They do not admit any obvious interpretation in terms of a modified 
cost function since they operate on L* and £* rather than W^. 

4.4 Practical issues 

4.4.1 Adding and removing objects over time 

Up to this point, we have assumed that the same objects are being observed at 
multiple time steps. In many application scenarios, however, new objects are often 
introduced over time while some existing objects may no longer be observed. In 
such a scenario, the indices of the proximity matrices and correspond to 
different objects, so one cannot simply combine them as described in (|8]). 

These types of scenarios can be dealt with in the following manner. Objects 
that were observed at time t — l but not at time t can simply be removed from 
in (J8]l. New objects introduced at time t have no corresponding rows and columns 
in These new objects can be naturally handled by adding rows and columns 

to ^* after performing the smoothing operation in ([8]). In this way, the new nodes 
have no influence on the update of the forgetting factor a* yet contribute to the 
clustering result through W. This process is illustrated graphically in Fig.p^ 
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4.4.2 Selecting the number of clusters 



The task of optimally choosing the number of clusters at each time step is a difficult 
model selection problem that is beyond the scope of this paper. However, since the 
proposed framework involves simply forming a smoothed proximity matrix fol- 
lowed by static clustering, heuristics used for selecting the number of clusters in 
static clustering can also be used with the proposed evolutionary clustering frame- 
work. One such heuristic applicable to many clustering algorithms is to choose the 



number of clusters to maximize the average silhouette width (Rousseeuw 19871. 
For hierarchical clustering, selection of the number of clusters is often accom- 



plished using a stopping rule; a review of many such rules can be found in Milligan 
and Cooper| ( |1985[ l. The eigengap heuristic ( |von Luxburg[ |2007] ) and the modular- 
ity criterion ( [Newman 2006| l are commonly used heuristics for spectral clustering. 
Any of these heuristics can be employed at each time step to choose the number of 
clusters, which can change over time. 



4.4.3 Matching clusters between time steps 

While the AFFECT framework provides a clustering result at each time that is 
consistent with past results, one still faces the challenge of matching clusters at 
time t with those at times t — 1 and earlier. This requires permuting the clusters 
in the clustering result at time t. If a one-to-one cluster matching is desired, then 
the cluster matching problem can be formulated as a maximum weight matching 
between the clusters at time t and those at time t — 1 with weights corresponding to 
the number of common objects between clusters. The maximum weight matching 



can be found in polynomial time using the Hungarian algorithm ( Kuhn 1955 1. The 
more general cases of many-to-one (multiple clusters being merged into a single 
cluster) and one-to-many (a cluster splitting into multiple clusters) matching are 



beyond the scope of this paper. We refer interested readers to Greene et al. (2010 1 
and IBrodka et al. ( 2012j ), both of which specifically address the cluster matching 
problem. 



5 Experiments 



We investigate the performance of the proposed AFFECT framework in five ex- 
periments involving both synthetic and real data sets. Tracking performance is 



measured in terms of the MSE E 



which is the criterion we seek 



1^* _ I 

to optimize. Clustering performance is measured by the Rand index ( Rand[ 1971 1, 
which is a quantity between and 1 that indicates the amount of agreement between 
a clustering result and a set of labels, which are taken to be the ground truth. A 
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Figure 7: Comparison of MSB in well-separated Gaussians experiment. The adap- 
tively estimated forgetting factor outperforms the constant forgetting factors and 
achieves MSB very close to the oracle forgetting factor. 



higher Rand index indicates higher agreement, with a Rand index of 1 correspond- 
ing to perfect agreement. We run at least one experiment for each of hierarchical 
clustering, k-means, and spectral clustering and compare the performance of AF- 
FECT against three recently proposed evolutionary clustering methods discussed 
in Section [2I3| RG, PCQ, and PCM. We run three iterations of AFFBCT unless 
otherwise specified. 



5.1 Well-separated Gaussians 

This experiment is designed to test the tracking ability of AFFBCT. We draw 40 
samples equally from a mixture of two 2-D Gaussian distributions with mean 
vectors (4,0) and (—4,0) and with both covariance matrices equal to 0.1/. At 
each time step, the means of the two distributions are moved according to a one- 
dimensional random walk in the first dimension with step size 0.1, and a new sam- 



ple is drawn with the component memberships fixed, as described in Section 3.3 
At time 19, we change the covariance matrices to 0.3/ to test how well the frame- 
work can respond to a sudden change. 

We run this experiment 100 times over 40 time steps using evolutionary k- 
means clustering. The two clusters are well-separated so even static clustering is 
able to correctly identify them. However the tracking performance is improved 
significantly by incorporating historical data, which can be seen in Fig. [7] where 
the MSB between the estimated and true similarity matrices is plotted for several 
choices of forgetting factor, including the estimated a*. We also compare to the 
oracle a*, which can be calculated using the true moments and cluster memberships 
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Figure 8: Comparison of oracle and estimated forgetting factors in well-separated 
Gaussians experiment. The gap between the estimated and oracle forgetting factors 
decreases as the sample size increases. 



of the data as shown in Appendix [Ajbut is not implementable in a real application. 
Notice that the estimated a* performs very well, and its MSB is very close to that 
of the oracle a*. The estimated a* also outperforms all of the constant forgetting 
factors. 

The estimated a* is plotted as a function of time in Fig. 8(a) Since the clusters 
are well-separated, only a single iteration is performed to estimate a*. Notice that 
both the oracle and estimated forgetting factors quickly increase from then level 
off to a nearly constant value until time 19 when the covariance matrix is changed. 
After the transient due to the change in covariance, both the oracle and estimated 
forgetting factors again level off. This behavior is to be expected because the two 
clusters are moving according to random walks. Notice that the estimated q* does 
not converge to the same value the oracle q* appears to. This bias is due to the finite 
sample size. The estimated and oracle forgetting factors are plotted in Fig. 8(b) for 
the same experiment but with 200 samples rather than 40. The gap between the 
steady-state values of the estimated and oracle forgetting factors is much smaller 
now, and it continues to decrease as the sample size increases. 



5.2 Two colliding Gaussians 

The objective of this experiment is to test the effectiveness of the AFFECT frame- 
work when a cluster moves close enough to another cluster so that they overlap. We 
also test the ability of the framework to adapt to a change in cluster membership. 

The setup of this experiment is illustrated in Fig. [9] We draw 40 samples from 
a mixture of two 2-D Gaussian distributions, both with covariance matrix equal to 
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Figure 9: Setup of two colliding Gaussians experiment: one cluster is slowly 
moved toward the other, then a change in cluster membership is simulated. 



identity. The mixture proportion (the proportion of samples drawn from the second 
cluster) is initially chosen to be 1/2. The first cluster has mean (3, 3) and remains 
stationary throughout the experiment. The second cluster's mean is initially at 
(—3, —3) and is moved toward the first cluster from time steps to 9 by (0.4, 0.4) 
at each time. At times 1 and 11 , we switch the mixture proportion to 3/8 and 1/4, 
respectively, to simulate objects changing cluster. From time 12 onwards, both the 
cluster means and mixture proportion are unchanged. At each time, we draw a new 
sample. 

We run this experiment 100 times using evolutionary k-means clustering. The 



MSE in this experiment for varying a* is shown in Fig. 10 As before, the ora- 
cle a* is calculated using the true moments and cluster memberships and is not 
implementable in practice. It can be seen that the choice of a* affects the MSE 
significantly. The estimated a* performs the best, excluding the oracle a*, which 
is not implementable. Notice also that q* = 0.5 performs well before the change 
in cluster memberships at time 10, i.e. when cluster 2 is moving, while a* = 0.75 
performs better after the change when both clusters are stationary. 



The clustering accuracy for this experiment is plotted in Fig. 1 1 Since this 



experiment involves k-means clustering, we compare to the RG method. We simu- 
late two filter lengths for RG: a short-memory 3rd-order filter and a long-memory 



lOth-order filter. In Fig. 11 it can be seen that the estimated a* also performs best 
in Rand index, approaching the performance of the oracle a*. The static method 
performs poorly as soon as the clusters begin to overlap at around time step 7. All 
of the evolutionary methods handle the overlap well, but the RG method is slow to 
respond to the change in clusters, especially the long-memory filter. In Table [T] we 
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Figure 10: Comparison of MSB in two colliding Gaussians experiment. The esti- 
mated a* performs best both before and after the change points. 
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Figure 1 1 : Comparison of Rand index in two colliding Gaussians experiment. The 
estimated a* detects the changes in clusters quickly unlike the RG method. 
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Table 1 : Means and standard errors of k-means Rand indices in two colliding Gaus- 
sians experiment. Bolded number indicates best performer within one standard 
error. 



Method 


Parameters 


Rand index 


Static 




0.899 ±0.002 




Estimated a* (3 iterations) 


0.984 ±0.001 


AFFECT 


Estimated a* (1 iteration) 


0.978 ±0.001 




a* = 0.5 


0.975 ±0.001 


RG 


/ = 3 


0.955 ±0.001 


I = 10 


0.861 ±0.001 




Figure 12: Comparison of oracle and estimated forgetting factors in two colliding 
Gaussians experiment. There is no noticeable change after the third iteration. 



present the means and standard errors (over the simulation runs) of the mean Rand 
indices of each method over all time steps. For AFFECT, we also show the Rand 
index when only one iteration is used to estimate a* and when arbitrarily setting 
a* = 0.5, both of which also outperform the RG method in this experiment. The 
poorer performance of the RG method is to be expected because it places more 
weight on time steps where the cluster centroids are well-separated, which again 
results in too much weight on historical data after the cluster memberships are 
changed. 

The estimated a* is plotted by iteration in Fig. 12 along with the oracle a*. 
Notice that the estimate gets better over the first three iterations, while the fourth 
and fifth show no visible improvement. The plot of the estimated a* suggests why 
it is able to outperform the constant a*'s. It is almost constant at the beginning of 
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the experiment when the second cluster is moving, then it decreases over the two 
times when cluster memberships are changed, and finally it increases when the 
two clusters are both stationary. The values of the oracle a* before and after the 
change corroborate the previous observation that a* = 0.5 performs well before 
the change, but a* = 0.75 performs better afterwards. Notice that the estimated a* 
appears to converge to a lower value than the oracle a*. This is once again due to 



the finite-sample effect discussed in Section 5.1 



5.3 Flocks of boids 

This experiment involves simulation of a natural phenomenon, namely the flocking 
behavior of birds. To simulate this phenomenon we use the bird-oid objects (boids) 
model proposed by Reynolds ( 1987[ ). The boids model allows us to simulate natural 



movements of objects and clusters. The behavior of the boids are governed by three 
main rules: 

1. Boids try to fly towards the average position (centroid) of local flock mates. 

2. Boids try to keep a small distance away from other boids. 

3. Boids try to fly towards the average heading of local flock mates. 

Our implementation of the boids model is based on the pseudocode of 'Parker 



(2007 1. At each time step, we move each bold 1/100 of the way towards the aver- 
age position of local flock mates, double the distance between boids that are within 
10 units of each other, and move each bold 1/8 of the way towards the average 
heading. 

We run two experiments using the boids model; one with a fixed number of 
flocks over time and one where the number of flocks varies over time. 



5.3.1 Fixed number of flocks 

Four flocks of 25 boids are initially distributed uniformly in separate 60 x 60 x 60 
cubes. To simulate boids moving continuously in time while being observed at 
regular time intervals, we allow each bold to perform five movements per time step 
according to the aforementioned rules. Similar to [Reynolds ( |1987 1, we use goal 



setting to push the flocks along parallel paths. Note that unlike in the previous 
experiments, the flocking behavior makes it possible to simulate natural changes in 
cluster, simply by changing the flock membership of a bold. We change the flock 
memberships of a randomly selected bold at each time step. The initial and final 



positions of the flocks for one realization are shown in Fig. 13 
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Figure 13: Setup of boids experiment: four flocks fly along parallel paths (start and 
end positions shown). At each time step, a randomly selected boid joins one of the 
other flocks. 
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Figure 14: Comparison of complete linkage Rand index in boids experiment. The 
estimated a* performs much better than static clustering and slightly better than 
the RG method. 



Table 2: Means and standard errors of complete linkage Rand indices in boids 
experiment. 



Method 


Parameters 


Rand index 


Static 




0.908 ±0.001 




Estimated a* (3 iterations) 


0.950 ±0.001 


AFFECT 


Estimated a* (1 iteration) 


0.945 ±0.001 




a* = 0.5 


0.945 ±0.001 


RG 


/ = 3 


0.942 ±0.001 


/ = 10 


0.939 ± 0.000 
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We run this experiment 100 times using complete linkage hierarchical cluster- 
ing. Unlike in the previous experiments, we do not know the true proximity matrix 
so MSE cannot be calculated. Clustering accuracy, however, can still be computed 
using the true flock memberships. The clustering performance of the various ap- 
proaches is displayed in Fig. [14] Notice that AFFECT once again performs better 
than RG, both with short and long memory, although the difference is much smaller 
than in the two colliding Gaussians experiment. The means and standard errors of 
the Rand indices for the various methods are listed in Table[2] Again, it can be seen 
that AFFECT is the best performer. The estimated a* in this experiment is roughly 
constant at around 0.6. This is not a surprise because all movements in this experi- 
ment, including changes in clusters, are smooth as a result of the flocking motions 
of the boids. This also explains the good performance of simply choosing a* = 0.5 
in this particular experiment. 



5.3.2 Variable number of flocks 

The difference between this second boids experiment and the first is that the num- 
ber of flocks changes over time in this experiment. Up to time 16, this experiment 
is identical to the previous one. At time 17, we simulate a scattering of the flocks 
by no longer moving them toward the average position of local flock mates as well 
as increasing the distance at which boids repel each other to 20 units. The boids 
are then rearranged at time 19 into two flocks rather than four. 

We run this experiment 100 times. The RG framework cannot handle changes 
in the number of clusters over time, thus we switch to normalized cut spectral 
clustering and compare AFFECT to PCQ and PCM. The number of clusters at each 



time step is estimated using the modularity criterion (Newman 2006 1. PCQ and 
PCM are not equipped with methods for selecting a. As a result, for each run of 
the experiment, we first performed a training run where the true flock memberships 
are used to compute the Rand index. The a which maximizes the Rand index is 
then used for the test run. 



The clustering performance is shown in Fig. 15 The Rand indices for all meth- 
ods drop after the flocks are scattered, which is to be expected. Shortly after the 
boids are rearranged into two flocks, the Rand indices improve once again as the 
flocks separate from each other. AFFECT once again outperforms the other meth- 
ods, which can also be seen from the summary statistics presented in Table [3] 
The performance of PCQ and PCM with both the trained a and arbitrarily chosen 
a = 0.5 are listed. Both outperform static clustering but perform noticeably worse 



than AFFECT with estimated a . From Fig. 15 it can be seen that the estimated a 



best responds to the rearrangement of the flocks. The estimated forgetting factor by 
iteration is shown in Fig. 16 Notice that the estimated a* drops when the flocks are 
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Figure 15: Comparison of spectral clustering Rand index in boids experiment. The 
estimated a* outperforms static clustering, PCQ, and PCM. 



Table 3: Means and standard errors of spectral clustering Rand indices in boids 
experiment. 



Method 


Parameters 


Rand index 


Static 




0.767 ±0.001 




Estimated a* (3 iterations) 


0.921 ±0.001 


AFFECT 


Estimated a* (1 iteration) 


0.921 ±0.001 




a* = 0.5 


0.873 ± 0.002 


PCQ 


Trained a 


0.779 ±0.001 


a = 0.5 


0.779 ±0.001 


PCM 


Trained a 


0.840 ± 0.002 


a = 0.5 


0.811 ±0.001 
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Time step 

Figure 16: Comparison of estimated spectral clustering forgetting factor by itera- 
tion in boids experiment. The estimated forgetting factor drops at the change point, 
i.e. when the flocks are scattered. There is no noticeable change in the forgetting 
factor after the second iteration. 




Figure 17: Comparison of number of clusters detected by spectral clustering in 
boids experiment. Using the estimated a* results in the best estimates of the num- 
ber of flocks (4 before the change point and 2 after). 
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scattered. Notice also that the estimates of a* hardly change after the first iteration, 
hence why performing one iteration of AFFECT achieves the same mean Rand in- 
dex as performing three iterations. Unlike in the previous experiments, a* = 0.5 
does not perform well in this experiment. 

Another interesting observation is that the most accurate estimate of the num- 



ber of clusters at each time is obtained when using AFFECT, as shown in Fig. 17 
Prior to the flocks being scattered, using AFFECT, PCQ, or PCM all result in good 
estimates for the number of clusters, while using the static method results in over- 
estimates. However, after the rearrangement of the flocks, the number of clusters is 
only accurately estimated when using AFFECT, which partially contributes to the 
poorer Rand indices of PCQ and PCM after the rearrangement. 

5.4 MIT Reality Mining 

The objective of this experiment is to test the proposed framework on a real data 
set with objects entering and leaving at different time steps. The experiment is con- 



ducted on the MIT Reality Mining data set (Eagle et al. 2009 1. The data was col- 
lected by recording cell phone activity of 94 students and staff at MIT over a year. 
Each phone recorded the Media Access Control (MAC) addresses of nearby Blue- 
tooth devices at five-minute intervals. Using this device proximity data, we con- 
struct a similarity matrix where the similarity between two students corresponds to 
the number of intervals where they were in physical proximity. We divide the data 
into time steps of one week, resulting in 46 time steps between August 2004 and 
June 2005. 

In this data set we have partial ground truth. Namely we have the affiliations 



of each participant. Eagle et al. ( 2009 1 found that two dominant clusters could 



be identified from the Bluetooth proximity data, corresponding to new students 
at the Sloan business school and coworkers who work in the same building. The 
affiliations are likely to be representative of the cluster structure, at least during the 
school year. 

We perform normalized cut spectral clustering into two clusters for this exper- 
iment and compare AFFECT with PCQ and PCM. Since this experiment involves 
real data, we cannot simulate training sets to select a for PCQ and PCM. Instead, 
we use 2-fold cross-validation, which we believe is the closest substitute. A com- 
parison of clustering performance is given in Table|4] Both the mean Rand indices 
over the entire 46 weeks and only during the school year are listed. AFFECT is the 
best performer in both cases. Surprisingly, PCQ barely performs better than static 
spectral clustering with the cross-validated a and even worse than static spectral 
clustering with a = 0.5. PCM fares better than PCQ with the cross-validated a but 
also performs worse than static spectral clustering with a = 0.5. We beheve this is 
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Table 4: Mean spectral clustering Rand indices for MIT Reality Mining experi- 
ment. Bolded number denotes best performer in each category. 



Method 


Parameters 


Rand index 


Entire trace 


School year 


Static 




0.853 


0.905 




Estimated a* (3 iterations) 


0.893 


0.953 


AFFECT 


Estimated a* (1 iteration) 


0.891 


0.953 




a* = 0.5 


0.882 


0.949 


PCQ 


Cross- validated a 


0.856 


0.905 


a = 0.5 


0.788 


0.854 


PCM 


Cross- validated a 


0.866 


0.941 


a = 0.5 
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Figure 18: Estimated a* over entire MIT Reality Mining data trace. Six important 
dates are indicated. The sudden drops in the estimated a* indicate change points in 
the network. 



31 











V ■ ^ : ■ ■ , V 

' r. ■■ w . ■ ■ 

1 / 













20 40 60 80 



20 
40 
60 
80 











J.-'. J. i 











20 40 60 80 



Figure 19: Cluster structure before (left) and after (right) beginning of winter break 
in MIT Reality Mining data trace. Darker entries correspond to greater time spent 
in physical proximity. The empty cluster to the upper left consists of inactive par- 
ticipants during the time step. 



due to the way PCQ and PCM suboptimally handle objects entering and leaving at 
different time steps by estimating previous similarities and memberships, respec- 



tively. On the contrary, the method used by AFFECT, described in Section 4.4. 1 
performs well even with objects entering and leaving over time. 



The estimated a* is shown in Fig. 18 Six important dates are labeled. The 



start and end dates of the terms were taken from the MIT academic calendar (MIT^ 



WWW I to be the first and last day of classes, respectively. Notice that the estimated 



a* appears to drop around several of these dates. These drops suggest that phys- 
ical proximities changed around these dates, which is reasonable, especially for 
the students because their physical proximities depend on their class schedules. 
For example, the similarity matrices at time steps 18 and 19, before and after the 
beginning of winter break, are shown in Fig. 19 The detected clusters using the es- 
timated a* are superimposed onto both matrices, with rows and columns permuted 
according to the clusters. Notice that the similarities, corresponding to time spent 
in physical proximity of other participants, are much lower at time 19, particularly 
in the smaller cluster. The change in the structure of the similarity matrix, along 
with the knowledge that the fall term ended and the winter break began around this 
time, suggests that the low estimated forgetting factor at time 19 is appropriate. 
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5.5 NASDAQ stock prices 

In this experiment, we test the proposed framework on a larger time-evolving 
data set, namely stock prices. We examined the daily prices of stocks listed on 
the NASDAQ stock exchange in 2008 ( Inf ochimps- WWW] | . Using a time step of 



3 weeks (15 days in which the stock market is operational), we construct a 15- 
dimensional vector for each stock where the ith coordinate consists of the differ- 
ence between the opening prices at the {i + l)th and zth days. Each vector is then 
normalized by subtracting its sample mean then dividing by its sample standard de- 
viation. Thus each feature vector x- corresponds to the normalized derivatives of 
the opening price sequences over the tth 15-day period. This type of feature vector 



was found by Gavrilov et al. ( 2000 1 to provide the most accurate static clustering 
results with respect to the sectors of the stocks, which are taken to be the ground 
truth cluster labels ( jNASDAQ- WWWp . The number of stocks in each sector in the 
data set for this experiment are listed in Table|5] resulting in a total of 2, 095 stocks. 

We perform evolutionary k-means clustering into 12 clusters, corresponding 
to the number of sectors. The mean Rand indices for AFFECT, static clustering, 
and RG are shown in Table [6] along with standard errors over five random k-means 
initializations. Since the RG method cannot deal with objects entering and leav- 
ing over time, we only cluster the 2, 049 stocks listed for the entire year for the 
Rand index comparison. AFFECT is once again the best performer, although the 
improvement is smaller compared to the previous experiments. 

The main advantage of the AFFECT framework when applied to this data set 



is revealed by the estimated a*, shown in Fig. 20 One can see a sudden drop in the 
estimated a* at t = 13 akin to the drop seen in the MIT Reality Mining experiment 



in Section 5.4 The sudden drop suggests that there was a significant change in 



the true proximity matrix around this time step, which happens to align with 



the stock market crash that occurred in late September 2008 ( Yahoo-WWW i, once 



again suggesting the veracity of the downward shift in the value of the estimated 
a*. 

We also evaluate the scalability of the AFFECT framework by varying the num- 
ber of objects to cluster. We selected the top 100, 250, 500, 1,000, and 1,500 
stocks in terms of their market cap and compared the computation time of the AF- 
FECT evolutionary k-means algorithm to the static k-means algorithm. The mean 
computation times over ten runs on a Linux machine with a 3.00 GHz Intel Xeon 



processor are shown in Fig. 21 Notice that the computation time for AFFECT 
when running a single iteration is almost equivalent to that of static k-means. The 
AFFECT procedure consists of iterating between static clustering and estimating 
a*. The latter involves simply computing sample moments over the clusters, which 
adds minimal complexity. Thus by performing a single AFFECT iteration, one 
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Table 5: Number of stocks in each NASDAQ sector in 2008. The sectors are taken 
to be the ground truth cluster labels for computing Rand indices. 



Sector 


Basic Industries 


Capital Goods 


Consumer Durables 


Stocks 


61 


167 


188 


Sector 


Consumer Non-Durables 


Consumer Services 


Energy 


Stocks 


93 


261 


69 


Sector 


Finance 


Health Care 


Miscellaneous 


Stocks 


472 


199 


65 


Sector 


Pubhc Utihties 


Technology 


Transportation 


Stocks 


69 


402 


49 



Table 6: Means and standard errors (over five random initializations) of k-means 
Rand indices for NASDAQ stock prices experiment. 



Method 


Parameters 


Rand index 


Static 




0.801 ±0.000 




Estimated a* (3 iterations) 


0.808 ±0.000 


AFFECT 


Estimated a* (1 iteration) 


0.806 ± 0.000 




a* = 0.5 


0.806 ±0.000 


RG 


/ = 3 


0.804 ± 0.000 


I = 10 


0.806 ±0.001 







/ Stock market crash 





5 10 15 

Time step 

Figure 20: Estimated a* over NASDAQ stock opening prices in 2008. The sudden 
drop aligns with the stock market crash in late September. 
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Number of stocks 

Figure 2 1 : Computation times of AFFECT k-means and static k-means for varying 
numbers of stocks. The estimation of a* in AFFECT adds hardly any computation 
time. 

can achieve better clustering performance, as shown in Table |6] with almost no 
increase in computation time. Notice also that the computation time of running a 
single AFFECT iteration when all 2, 095 stocks are clustered is actually less than 
that of static k-means. This is due to the iterative nature of k-means; clustering on 
the smoothed proximities results in faster convergence of the k-means algorithm. 
As the number of objects increases, the decrease in the computation time due to 
faster k-means convergence is greater than the increase due to estimating a*. The 
same observations apply for 3 iterations of AFFECT when compared to 3 times the 
computation time for static clustering (labeled as "3 x static clustering"). 

6 Conclusion 

In this paper we proposed a novel adaptive framework for evolutionary clustering 
by performing tracking followed by static clustering. The objective of the frame- 
work was to accurately track the true proximity matrix at each time step. This 
was accomplished using a recursive update with an adaptive forgetting factor that 
controlled the amount of weight to apply to historic data. We proposed a method 
for estimating the optimal forgetting factor in order to minimize mean squared 
tracking error. The main advantages of our approach are its universality, allowing 
almost any static clustering algorithm to be extended to an evolutionary one, and 



35 



that it provides an explicit method for selecting the forgetting factor, unlike exist- 
ing methods. The proposed framework was evaluated on several synthetic and real 
data sets and displayed good performance in tracking and clustering. It was able 
to outperform both static clustering algorithms and existing evolutionary clustering 
algorithms. 

There are many interesting avenues for future work. In the experiments pre- 
sented in this paper, the estimated forgetting factor appeared to converge after three 
iterations. We intend to investigate the convergence properties of this iterative pro- 
cess in the future. In addition, we would like to improve the finite-sample behavior 
of the estimator. Finally, we plan to investigate other loss functions and models for 
the true proximity matrix. We chose to optimize MSB and work with a block model 
in this paper, but perhaps other functions or models may be more appropriate for 
certain applications. 



Appendix A True similarity matrix for dynamic Gaussian 
mixture model 

We derive the true similarity matrix ^ and the matrix of variances of similarities 
var(Ty), where the similarity is taken to be the dot product, for data sampled from 
the dynamic Gaussian mixture model described in Section 3.3 These matrices are 
required in order to calculate the oracle forgetting factor for the experiments in 
Sections [5TT] and 5.2 We drop the superscript t to simplify the notation. 

Consider two arbitrary objects Xj ~ N{fic, ^c) and Xj ~ N{fid, ^d) where 
the entries of /Xc and Sc are denoted by and acki, respectively. For any distinct 
i,j the mean is 

p p 

k=l k=l 

and the variance is 



var 



(xixj) = E [(xixj)'] - E [xixj]' 

p p 

= ^ ^ {E [xikXjkXiiXji] - ^lck^J'dk^J'dfJ'dl} 

k=l 1=1 

p p 

= ^ ^ {(0"cM + IJ'cklJ'cl) {(^dkl + fJ-dkfJ-dl) — ^^ck^^dk^J'cl^^dl} 
k=l 1=1 

P P 

= ^ ^ WcklO'dkl + O'ckltJ'dkfJ'dl + <^dkll^ckl^cl} 



k=l 1=1 
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by independence of Xj and Xj. This holds both for x,;,Xj in the same cluster, 
i.e. c = d, and for Xj, Xj in different clusters, i.e. d. Along the diagonal. 



p p 



E [x,xf ] = ^ E [xffc] = (^cfcfc + ■ 

k=l k=l 

The calculation for the variance is more involved. We first note that 

[ 221222 2 2 

^ik^ili = f^ckf^d + fJ-ck^^cll + 'ifJ'ckfJ'clf^ckl + fJ-d^^ckk + '^(^ckl + (^ckkCTdl, 

which can be derived from the characteristic function of the multivariate Gaussian 



distribution (Anderson 20031. Thus 



p p 

var (x,xf ) = X] [^Ik^l] - [^ckk + fJ-lk) {'^dl + fJ-li) } 

k=l 1=1 
P P 

= X] X] {"^f^ckfJ-dfTcki + 2a'^ki} ■ 



k=l 1=1 



The calculated means and variances are then substituted into ( [13] ) to compute the 
oracle forgetting factor. Since the expressions for the means and variances depend 
only on the clusters and not any objects in particular, it is confirmed that both ^ 
and var(VF) do indeed possess the assumed block structure discussed in Section 
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